Project Name: Social Media Sentiment Analysis Dataset¶
Dataset: Social Media Sentiment Analysis Dataset
This project aims to analyze user-generated content across various social media platforms to uncover sentiment trends and user behavior. The dataset offers a rich source of data, including text-based content, user sentiments, timestamps, hashtags, user engagement metrics (likes and retweets), and geographical information. By exploring this data, we can identify how emotions fluctuate over time, platform, and geography. We will also investigate the correlation between popular content and user engagement metrics.
Problem Statement: The primary goal is to perform sentiment analysis, investigate temporal and geographical trends in user-generated content, and analyze platform-specific user behavior. The project will focus on identifying popular topics through hashtags, exploring engagement levels, and understanding regional differences in sentiment trends.
Tasks:
- Dataset Exploration:
- Gain familiarity with the dataset by understanding its structure and key features such as sentiment, timestamps, and user engagement (likes and retweets).
- Sentiment Analysis:
- Conduct sentiment analysis to classify the user-generated content into different categories such as surprise, excitement, admiration, etc.
- Visualize the distribution of sentiments and examine the emotional landscape of social media platforms.
- Temporal Analysis:
- Explore temporal patterns in user sentiment over time using the "Timestamp" column.
- Identify recurring themes, seasonal variations, or any significant trends in the data.
- User Engagement Insights:
- Analyze user engagement by studying the likes and retweets associated with posts.
- Investigate how sentiment correlates with higher levels of user engagement.
- Platform-Specific Analysis:
- Compare sentiment trends across various platforms using the "Platform" column.
- Identify how emotions differ depending on the platform.
- Hashtag and Topic Trends:
- Explore trending topics by analyzing the hashtags.
- Investigate the relationship between hashtags and user engagement or sentiment.
- Geographical Trends:
- Examine regional sentiment variations using the "Country" column.
- Understand how social media content and sentiment differ across various regions.
- Cross-Feature Analysis:
- Combine features (e.g., sentiment and hashtags, sentiment and platform) to uncover deeper insights about user behavior and content trends.
- Predictive Modeling (Optional):
- Explore the possibility of building predictive models to predict user engagement (likes/retweets) based on sentiment, hashtags, and platform.
- Evaluate the performance of the model and explore its potential for predicting popular content.
Students are encouraged to draw connections between data-driven insights and potential policy implications.
Students are encouraged to draw connections between data-driven insights and potential policy implications. The project should foster a deeper understanding of the dynamics of air quality in India and its impact on public health and the environment.
Import Libraries¶
# System Utilities
import os
import sys
import re
import math
import random
import colorsys
import shutil
import zipfile
import subprocess
from pathlib import Path
# Data Handling
import pandas as pd
import numpy as np
import warnings
# Visualization Libraries
import matplotlib.pyplot as plt
from matplotlib import font_manager, rcParams
import plotly.express as px
# WordCloud
from PIL import Image, ImageFont
from wordcloud import WordCloud, STOPWORDS
import jieba
import colorsys
from collections import Counter
# Other
from IPython.display import display # DataFrame
import kagglehub # Download KaggleHub dataset
import os
import sys
import re
import math
import random
import colorsys
import shutil
import zipfile
import subprocess
from pathlib import Path
import warnings
# Data Handling
import pandas as pd
import numpy as np
from IPython.display import display # For DataFrame display
# Visualization
import matplotlib.pyplot as plt
from matplotlib import font_manager, rcParams
import plotly.express as px # Plotly
import plotly.graph_objects as go
# WordCloud
from PIL import Image, ImageFont
from wordcloud import WordCloud, STOPWORDS
import jieba
from collections import Counter
# Machine Learning
from sklearn.model_selection import train_test_split, RandomizedSearchCV
# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
# Classification Models
from sklearn.linear_model import PassiveAggressiveClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
# Metrics
from sklearn.metrics import (
accuracy_score,
classification_report,
confusion_matrix
)
import kagglehub # To download dataset from KaggleHub
Dataset acquisition¶
# Install needed packages
%pip -q install kagglehub pandas matplotlib scikit-learn nltk
Note: you may need to restart the kernel to use updated packages.
# Download dataset latest version
# path = kagglehub.dataset_download("kashishparmar02/social-media-sentiments-analysis-dataset")
cache = Path(kagglehub.dataset_download("kashishparmar02/social-media-sentiments-analysis-dataset"))
print("KaggleHub cache:", cache)
# Prepare and clear ./data folder
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
for p in data_dir.iterdir():
if p.is_file():
p.unlink() # remove file
else:
shutil.rmtree(p)
# Collect .csv files
csv_found = []
for f in cache.rglob("*.csv"):
dst = data_dir / f.name
if not dst.exists(): # Avoid duplicates
shutil.copy2(f, dst)
csv_found.append(dst.name)
# If none found, scan all zips and extract ONLY .csv files into ./data
if not csv_found:
for z in cache.rglob("*.zip"):
try:
with zipfile.ZipFile(z) as zf:
for member in zf.infolist():
# Filter by extension .csv
name = Path(member.filename).name
if name.lower().endswith(".csv"):
with zf.open(member) as src, open(data_dir / name, "wb") as dst:
shutil.copyfileobj(src, dst)
csv_found.append(name)
except Exception as e:
print("Skip bad zip:", z, "->", e)
# 5) 结果检查与回显 / Verify result and show summary
if not csv_found:
raise FileNotFoundError("No .csv found.")
print("Path to dataset files:", data_dir.resolve(), ", the dataset file name is:", csv_found)
KaggleHub cache: C:\Users\yujua\.cache\kagglehub\datasets\kashishparmar02\social-media-sentiments-analysis-dataset\versions\3 Path to dataset files: C:\Users\yujua\Desktop\F25\DataScience_BootCamp\SenseSmart\data , the dataset file name is: ['sentimentdataset.csv']
Load Data & Column Standardization¶
Text — the post text Sentiment — emotion label (e.g., Positive, Negative, Neutral, ...) Timestamp — time the post was made Platform — social platform name Likes — number of likes Retweets — number of retweets Country — country string Hashtags — raw hashtag text
# Data directory ./data
DATA_DIR = Path("data")
# Pick the dataset file in ./data
csv_files = sorted(DATA_DIR.glob("*.csv"))
if not csv_files:
raise FileNotFoundError("No data file found in ./data.\n")
DATA_FILE = csv_files[0]
print("Selected file:", DATA_FILE.name)
# Read the CSV
df = pd.read_csv(DATA_FILE, low_memory=False)
print("Raw shape:", df.shape)
print("Raw columns:", list(df.columns))
# Normalize original column names to lowercase + strip for matching
# df = df.rename(columns={c: str(c).lower().strip() for c in df.columns})
# Drop index-duplicates: Unnamed
unnamed_cols = [c for c in df.columns if re.match(r"^Unnamed", str(c), flags=re.IGNORECASE)]
if unnamed_cols:
df = df.drop(columns=unnamed_cols)
print("Dropped Unnamed columns:", unnamed_cols)
# Preview
desired_order = ["Text", "Sentiment", "Timestamp", "User",
"Platform", "Hashtags", "Retweets", "Likes",
"Country", "Year", "Month", "Day", "Hour"
]
if all(col in df.columns for col in desired_order):
view = df[desired_order]
else:
present = [c for c in desired_order if c in df.columns]
others = [c for c in df.columns if c not in present]
view = df[present + others]
print("\nPreview:")
display(view.head())
print("Final shape:", view.shape)
print("Final columns:", list(view.columns))
Selected file: sentimentdataset.csv Raw shape: (732, 15) Raw columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour'] Dropped Unnamed columns: ['Unnamed: 0.1', 'Unnamed: 0'] Preview:
| Text | Sentiment | Timestamp | User | Platform | Hashtags | Retweets | Likes | Country | Year | Month | Day | Hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Enjoying a beautiful day at the park! ... | Positive | 2023-01-15 12:30:00 | User123 | #Nature #Park | 15.0 | 30.0 | USA | 2023 | 1 | 15 | 12 | |
| 1 | Traffic was terrible this morning. ... | Negative | 2023-01-15 08:45:00 | CommuterX | #Traffic #Morning | 5.0 | 10.0 | Canada | 2023 | 1 | 15 | 8 | |
| 2 | Just finished an amazing workout! 💪 ... | Positive | 2023-01-15 15:45:00 | FitnessFan | #Fitness #Workout | 20.0 | 40.0 | USA | 2023 | 1 | 15 | 15 | |
| 3 | Excited about the upcoming weekend getaway! ... | Positive | 2023-01-15 18:20:00 | AdventureX | #Travel #Adventure | 8.0 | 15.0 | UK | 2023 | 1 | 15 | 18 | |
| 4 | Trying out a new recipe for dinner tonight. ... | Neutral | 2023-01-15 19:55:00 | ChefCook | #Cooking #Food | 12.0 | 25.0 | Australia | 2023 | 1 | 15 | 19 |
Final shape: (732, 13) Final columns: ['Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour']
Platform Sentiment Bias Analysis¶
pd.set_option('display.show_dimensions', True) # The default is'truncate',change to True make it always display all of them
# 1) Basic validations, ensure there have 'Platform', 'Sentiment' columns
required_cols = {'Platform', 'Sentiment'}
missing = required_cols - set(df.columns)
assert not missing, f"Missing required columns: {missing}"
# Copy and cleaning
df = df.copy()
df['Platform'] = df['Platform'].astype(str).str.strip()
df['Sentiment'] = df['Sentiment'].astype(str).str.strip()
# Drop rows with blank critical fields
df = df[(df['Platform'] != '') & (df['Sentiment'] != '')].reset_index(drop=True)
# Normalize platform names ("twitter"->"Twitter")
df['Platform'] = df['Platform'].str.lower().str.capitalize()
# 2) Sentiment mapping by VADER
# compound ∈ [-1,1]. Positive > 0.05;Negative < -0.05;other is Neutral(VADER suggest)
import nltk
nltk.download('vader_lexicon', quiet=True)
from nltk.sentiment import SentimentIntensityAnalyzer
_vader = SentimentIntensityAnalyzer()
def map_to_polarity(text: str, pos_th: float = 0.05, neg_th: float = -0.05) -> str:
# Map phrases to 3 classes using VADER compound score: Positive / Negative / Neutral
if not isinstance(text, str) or text.strip() == '':
return 'Neutral' # Empty is Neutral
comp = _vader.polarity_scores(text)['compound']
if comp > pos_th:
return 'Positive'
elif comp < neg_th:
return 'Negative'
else:
return 'Neutral'
df['Sentiment_Category'] = df['Sentiment'].apply(map_to_polarity)
print("Mapping preview:")
display(df[['Sentiment', 'Sentiment_Category']].head(100))
# List terms final class for check
def _clean_term(s: str) -> str:
return str(s).strip()
pos_terms_vc = (
df.loc[df['Sentiment_Category'] == 'Positive', 'Sentiment']
.map(_clean_term)
.value_counts()
)
neg_terms_vc = (
df.loc[df['Sentiment_Category'] == 'Negative', 'Sentiment']
.map(_clean_term)
.value_counts()
)
neu_terms_vc = (
df.loc[df['Sentiment_Category'] == 'Neutral', 'Sentiment']
.map(_clean_term)
.value_counts()
)
print("\n🟢 Positive terms")
display(pos_terms_vc.to_frame('count').reset_index().rename(columns={'index': 'term'}))
print("\n🔴 Negative terms")
display(neg_terms_vc.to_frame('count').reset_index().rename(columns={'index': 'term'}))
print("\n⚪ Neutral terms")
display(neu_terms_vc.to_frame('count').reset_index().rename(columns={'index': 'term'}))
# Export complete file
pos_terms_vc.to_csv("classified_positive_terms.csv")
neg_terms_vc.to_csv("classified_negative_terms.csv")
neu_terms_vc.to_csv("classified_neutral_terms.csv")
print("Save as classified_positive_terms.csv / classified_negative_terms.csv / classified_neutral_terms.csv")
Mapping preview:
| Sentiment | Sentiment_Category | |
|---|---|---|
| 0 | Positive | Positive |
| 1 | Negative | Negative |
| 2 | Positive | Positive |
| 3 | Positive | Positive |
| 4 | Neutral | Neutral |
| ... | ... | ... |
| 95 | Confusion | Negative |
| 96 | Excitement | Positive |
| 97 | Kind | Positive |
| 98 | Pride | Positive |
| 99 | Shame | Negative |
100 rows × 2 columns
🟢 Positive terms
| Sentiment | count | |
|---|---|---|
| 0 | Positive | 45 |
| 1 | Joy | 44 |
| 2 | Excitement | 37 |
| 3 | Contentment | 19 |
| 4 | Gratitude | 18 |
| ... | ... | ... |
| 75 | Thrilling Journey | 1 |
| 76 | Creative Inspiration | 1 |
| 77 | Runway Creativity | 1 |
| 78 | Ocean's Freedom | 1 |
| 79 | Relief | 1 |
80 rows × 2 columns
🔴 Negative terms
| Sentiment | count | |
|---|---|---|
| 0 | Despair | 11 |
| 1 | Grief | 9 |
| 2 | Loneliness | 9 |
| 3 | Sad | 9 |
| 4 | Embarrassed | 8 |
| 5 | Confusion | 8 |
| 6 | Frustration | 6 |
| 7 | Melancholy | 6 |
| 8 | Regret | 6 |
| 9 | Indifference | 6 |
| 10 | Hate | 6 |
| 11 | Bad | 6 |
| 12 | Numbness | 6 |
| 13 | Disgust | 5 |
| 14 | Bitterness | 5 |
| 15 | Frustrated | 5 |
| 16 | Betrayal | 5 |
| 17 | Negative | 4 |
| 18 | Boredom | 4 |
| 19 | Heartbreak | 3 |
| 20 | Jealousy | 3 |
| 21 | Resentment | 3 |
| 22 | Shame | 3 |
| 23 | Bitter | 3 |
| 24 | Devastated | 3 |
| 25 | Envious | 3 |
| 26 | Fearful | 3 |
| 27 | Jealous | 3 |
| 28 | Sadness | 2 |
| 29 | Fear | 2 |
| 30 | Anger | 2 |
| 31 | Disappointed | 2 |
| 32 | Loss | 2 |
| 33 | Helplessness | 2 |
| 34 | Intimidation | 2 |
| 35 | Anxiety | 2 |
| 36 | Envy | 2 |
| 37 | Isolation | 2 |
| 38 | Disappointment | 2 |
| 39 | Sorrow | 2 |
| 40 | Bittersweet | 1 |
| 41 | Darkness | 1 |
| 42 | Exhaustion | 1 |
| 43 | Suffering | 1 |
| 44 | Desperation | 1 |
| 45 | Pressure | 1 |
| 46 | Ruins | 1 |
| 47 | Obstacle | 1 |
48 rows × 2 columns
⚪ Neutral terms
| Sentiment | count | |
|---|---|---|
| 0 | Neutral | 18 |
| 1 | Curiosity | 16 |
| 2 | Serenity | 15 |
| 3 | Nostalgia | 11 |
| 4 | Awe | 9 |
| ... | ... | ... |
| 58 | Imagination | 1 |
| 59 | Mesmerizing | 1 |
| 60 | Winter Magic | 1 |
| 61 | Celestial Wonder | 1 |
| 62 | Whispers of the Past | 1 |
63 rows × 2 columns
Save as classified_positive_terms.csv / classified_negative_terms.csv / classified_neutral_terms.csv
# 3) Aggregate counts & ratios
# Group by Platform and Sentiment_Category to count posts per sentiment
count_tbl = (
df.groupby(['Platform', 'Sentiment_Category'], observed=True) # observed=True to aviod FutureWarning
.size()
.unstack(fill_value=0) # Expand the column into Positive/Negative/Neutral, and fill in 0 for missing values.
)
# Ensure all 3 sentiment columns exist
for col in ['Positive', 'Negative', 'Neutral']:
if col not in count_tbl.columns:
count_tbl[col] = 0
# Totals and ratios
count_tbl['Total'] = count_tbl[['Positive', 'Negative', 'Neutral']].sum(axis=1)
safe_total = count_tbl['Total'].replace(0, np.nan) # 避免除以 0
count_tbl['Positive_Ratio'] = count_tbl['Positive'] / safe_total
count_tbl['Negative_Ratio'] = count_tbl['Negative'] / safe_total
count_tbl['Neutral_Ratio'] = count_tbl['Neutral'] / safe_total
# 4) Correctness checks
# Check that counts add up to Total
_num_ok = (count_tbl['Positive'] + count_tbl['Negative'] + count_tbl['Neutral'] == count_tbl['Total']).all()
assert _num_ok, "Counts don't sum to Total."
# Check that ratios add up to 1
ratio_sum = (count_tbl['Positive_Ratio'] + count_tbl['Negative_Ratio'] + count_tbl['Neutral_Ratio'])
ratio_ok = np.allclose(ratio_sum.dropna().values, np.ones(ratio_sum.dropna().shape[0]), rtol=1e-6, atol=1e-6)
assert ratio_ok, "Ratios don't sum to 1."
print("📊 Counts & Ratios by Platform")
display(count_tbl.sort_index())
# Prepare a reset_index version for plotting
count_reset = count_tbl.reset_index()
# Make sure the platform column is called 'Platform'
if 'Platform' not in count_reset.columns:
count_reset = count_reset.rename(columns={count_reset.columns[0]: 'Platform'})
# 5) Visualization
# Stacked bar for composition
fig_counts = px.bar(
count_reset,
x='Platform',
y=['Positive', 'Negative', 'Neutral'],
title="Sentiment Distribution by Platform (Counts)",
labels={
"value": "Post Count",
"variable": "Sentiment",
"Platform": "Platform"
},
color_discrete_sequence=px.colors.qualitative.Set2,
template='plotly_dark'
)
fig_counts.update_layout(
barmode='stack',
width=900,
height=500,
)
fig_counts.show()
# Grouped Bar: Ratios
ratio_plot = count_reset[['Platform', 'Positive_Ratio', 'Negative_Ratio', 'Neutral_Ratio']].copy()
# Melt into long form: Platform, Sentiment, Ratio
ratio_long = ratio_plot.melt(
id_vars='Platform',
value_vars=['Positive_Ratio', 'Negative_Ratio', 'Neutral_Ratio'],
var_name='Sentiment',
value_name='Ratio'
)
# Clean sentiment names for legend (drop "_Ratio")
ratio_long['Sentiment'] = ratio_long['Sentiment'].str.replace('_Ratio', '', regex=False)
fig_ratio = px.bar(
ratio_long,
x='Platform',
y='Ratio',
color='Sentiment',
text='Ratio',
barmode='group',
title="Platform Sentiment Bias (Ratios)",
labels={
"Platform": "Platform",
"Ratio": "Ratio",
"Sentiment": "Sentiment"
},
template='plotly_dark',
color_discrete_sequence=px.colors.qualitative.Set2
)
# Show text as percentage
fig_ratio.update_traces(
texttemplate='%{text:.1%}',
textposition='outside'
)
ymax = ratio_long['Ratio'].max() * 1.2
fig_ratio.update_yaxes(
range=[0, ymax],
tickformat=".0%"
)
fig_ratio.update_layout(
width=900,
height=500,
)
fig_ratio.show()
# 6) Text Summary
print("📑 Summary")
for plat, row in count_tbl.iterrows():
pos = float(row['Positive_Ratio']) if not math.isnan(row['Positive_Ratio']) else 0.0
neg = float(row['Negative_Ratio']) if not math.isnan(row['Negative_Ratio']) else 0.0
neu = float(row['Neutral_Ratio']) if not math.isnan(row['Neutral_Ratio']) else 0.0
if pos > max(neg, neu):
bias = "🟢 Positive-leaning"
elif neg > max(pos, neu):
bias = "🔴 Negative-leaning"
elif neu > max(pos, neg):
bias = "⚪ Neutral-leaning"
else:
bias = "⚖️ No clear bias"
print(f"{plat}: Positive {pos:.1%}, Negative {neg:.1%}, Neutral {neu:.1%} → {bias}")
📊 Counts & Ratios by Platform
| Sentiment_Category | Negative | Neutral | Positive | Total | Positive_Ratio | Negative_Ratio | Neutral_Ratio |
|---|---|---|---|---|---|---|---|
| Platform | |||||||
| 56 | 50 | 125 | 231 | 0.541126 | 0.242424 | 0.216450 | |
| 62 | 62 | 134 | 258 | 0.519380 | 0.240310 | 0.240310 | |
| 65 | 59 | 119 | 243 | 0.489712 | 0.267490 | 0.242798 |
3 rows × 7 columns
📑 Summary Facebook: Positive 54.1%, Negative 24.2%, Neutral 21.6% → 🟢 Positive-leaning Instagram: Positive 51.9%, Negative 24.0%, Neutral 24.0% → 🟢 Positive-leaning Twitter: Positive 49.0%, Negative 26.7%, Neutral 24.3% → 🟢 Positive-leaning
Sentiment vs Engagement (Retweets & Likes)¶
# Use VADER-based sentiment classification: Positive / Negative / Neutral
# Clean the sentiment category column
df['Sentiment_Category'] = df['Sentiment_Category'].astype(str).str.strip()
# Keep only the 3 VADER sentiment categories
sent_order = ['Positive', 'Negative', 'Neutral']
sent3_df = df[df['Sentiment_Category'].isin(sent_order)].copy()
sent3_df['Sentiment_3'] = pd.Categorical(
sent3_df['Sentiment_Category'],
categories=sent_order,
ordered=True
)
print("Number of valid rows after VADER 3-way classification:", len(sent3_df))
print("Rows per sentiment:")
print(sent3_df['Sentiment_3'].value_counts())
Number of valid rows after VADER 3-way classification: 732 Rows per sentiment: Sentiment_3 Positive 378 Negative 183 Neutral 171 Name: count, Length: 3, dtype: int64
# Calculate the mean/median/count of Retweets and Likes
sent_engagement = (
sent3_df
.groupby('Sentiment_3', sort=False, observed=True)[['Retweets', 'Likes']] # Grouped by sentiment, Retweets and Likes were tallied
.agg(['mean', 'median', 'count'])
.reindex(sent_order)
.round(2)
)
print("Engagement by Sentiment (Mean / Median / Count)")
display(sent_engagement)
Engagement by Sentiment (Mean / Median / Count)
| Retweets | Likes | |||||
|---|---|---|---|---|---|---|
| mean | median | count | mean | median | count | |
| Sentiment_3 | ||||||
| Positive | 22.89 | 22.0 | 378 | 45.64 | 45.0 | 378 |
| Negative | 17.35 | 18.0 | 183 | 34.63 | 35.0 | 183 |
| Neutral | 22.89 | 22.0 | 171 | 45.69 | 45.0 | 171 |
3 rows × 6 columns
# Compute mean Retweets & Likes per sentiment
plot_data = (
sent3_df
.groupby('Sentiment_3', sort=False, observed=True)[['Retweets', 'Likes']]
.mean()
.reindex(sent_order)
.reset_index()
)
display(plot_data)
# Average Retweets
fig_ret = px.bar(
plot_data,
x='Sentiment_3',
y='Retweets',
color='Retweets',
color_continuous_scale='Viridis',
text='Retweets',
title="Average Retweets by Sentiment",
labels={
'Sentiment_3': 'Sentiment',
'Retweets': 'Average Retweets'
},
template='plotly_dark'
)
fig_ret.update_traces(
texttemplate='%{text:.1f}',
textposition='outside'
)
fig_ret.update_yaxes(range=[0, plot_data['Retweets'].max() * 1.2])
fig_ret.update_layout(width=900, height=500)
fig_ret.show()
# Average Likes
fig_like = px.bar(
plot_data,
x='Sentiment_3',
y='Likes',
color='Likes',
color_continuous_scale='Viridis',
text='Likes',
title="Average Likes by Sentiment",
labels={
'Sentiment_3': 'Sentiment',
'Likes': 'Average Likes'
},
template='plotly_dark'
)
fig_like.update_traces(
texttemplate='%{text:.1f}',
textposition='outside'
)
fig_like.update_yaxes(range=[0, plot_data['Likes'].max() * 1.2])
fig_like.update_layout(width=900, height=500)
fig_like.show()
| Sentiment_3 | Retweets | Likes | |
|---|---|---|---|
| 0 | Positive | 22.894180 | 45.642857 |
| 1 | Negative | 17.349727 | 34.633880 |
| 2 | Neutral | 22.894737 | 45.690058 |
3 rows × 3 columns
# Boxplots: distribution of Retweets & Likes by Sentiment
# Retweets boxplot
fig_ret_box = px.box(
sent3_df,
x='Sentiment_3',
y='Retweets',
color='Sentiment_3',
category_orders={'Sentiment_3': sent_order},
title="Retweets Distribution by Sentiment",
labels={
'Sentiment_3': 'Sentiment',
'Retweets': 'Retweets'
},
template='plotly_dark',
color_discrete_sequence=px.colors.qualitative.Set2
)
fig_ret_box.update_layout(
width=900,
height=500
)
fig_ret_box.show()
# Likes boxplot
fig_like_box = px.box(
sent3_df,
x='Sentiment_3',
y='Likes',
color='Sentiment_3',
category_orders={'Sentiment_3': sent_order}, # 保证顺序一致
title="Likes Distribution by Sentiment",
labels={
'Sentiment_3': 'Sentiment',
'Likes': 'Likes'
},
template='plotly_dark',
color_discrete_sequence=px.colors.qualitative.Set2
)
fig_like_box.update_layout(
width=900,
height=500
)
fig_like_box.show()
# Detailed statistics table for each sentiment, used to assist in interpreting the boxplots
stats = []
for sentiment in sent_order:
# Extract values for each sentiment
ret_values = sent3_df[sent3_df['Sentiment_3'] == sentiment]['Retweets']
like_values = sent3_df[sent3_df['Sentiment_3'] == sentiment]['Likes']
stats.append({
"Sentiment": sentiment,
# Retweets
"Retweets_Q1": np.percentile(ret_values, 25),
"Retweets_Median": np.median(ret_values),
"Retweets_Q3": np.percentile(ret_values, 75),
"Retweets_Min": ret_values.min(),
"Retweets_Max": ret_values.max(),
"Retweets_SampleSize": len(ret_values),
# Likes
"Likes_Q1": np.percentile(like_values, 25),
"Likes_Median": np.median(like_values),
"Likes_Q3": np.percentile(like_values, 75),
"Likes_Min": like_values.min(),
"Likes_Max": like_values.max(),
"Likes_SampleSize": len(like_values)
})
stats_df = pd.DataFrame(stats).round(2)
print("Detailed Statistical Summary for Sentiment Categories")
display(stats_df)
Detailed Statistical Summary for Sentiment Categories
| Sentiment | Retweets_Q1 | Retweets_Median | Retweets_Q3 | Retweets_Min | Retweets_Max | Retweets_SampleSize | Likes_Q1 | Likes_Median | Likes_Q3 | Likes_Min | Likes_Max | Likes_SampleSize | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Positive | 18.0 | 22.0 | 28.0 | 8.0 | 40.0 | 378 | 35.0 | 45.0 | 55.0 | 15.0 | 80.0 | 378 |
| 1 | Negative | 12.0 | 18.0 | 22.0 | 5.0 | 40.0 | 183 | 25.0 | 35.0 | 45.0 | 10.0 | 80.0 | 183 |
| 2 | Neutral | 18.0 | 22.0 | 28.0 | 10.0 | 40.0 | 171 | 35.0 | 45.0 | 55.0 | 20.0 | 80.0 | 171 |
3 rows × 13 columns
Word Cloud¶
# Dependencies
pkgs = ["wordcloud", "jieba", "pillow", "matplotlib", "numpy", "pandas"]
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
print("Installed:", ", ".join(pkgs))
Installed: wordcloud, jieba, pillow, matplotlib, numpy, pandas
# Configurable Parameters
OUT_DIR = "outputs" # Output folder for generated PNGs and top-token CSVs
CSV_FALLBACK_PATH = "/mnt/data/sentimentdataset.csv" # If no global df is found, fall back to reading this CSV
TEXT_COL_HINT = "text" # Preferred name of the text column (case-insensitive)
GROUP_BY_COL = None # If set (e.g., "Platform"), build one cloud per group; else a single cloud.
MASK_PATH = None # Optional mask image path (white=allowed area)
RAND_SEED = 42 # Random seed for reproducibility
IMG_SIZE = (2400, 1400) # Output resolution in pixels: width x height
MAX_WORDS = 500 # Maximum number of words to display
BACKGROUND_COLOR = "white" # Background color (e.g., "black").
RELATIVE_SCALING = 0.5 # Strength of frequency-to-font-size mapping (0~1)
PREFER_HORIZONTAL = 0.9 # Proportion of words drawn horizontally (0~1)
COLORMAP_NAME = None # If using Matplotlib colormap (e.g., "tab20"), set COLOR_FUNC=None and set this
random.seed(RAND_SEED); np.random.seed(RAND_SEED)
# Pre-configure fonts to minimize missing glyph warnings for CJK
rcParams["font.sans-serif"] = ["SimHei","Microsoft YaHei","Arial Unicode MS","DejaVu Sans"]
rcParams["axes.unicode_minus"] = False
# Text cleaning & tokenization
URL = re.compile(r"https?://\S+|www\.\S+", re.I) # strip URLs
AT = re.compile(r"@[A-Za-z0-9_]+") # strip @mentions
HASH = re.compile(r"#") # remove '#'
HTML = re.compile(r"&[A-Za-z]+;") # handle HTML entities
CN_CHAR = re.compile(r"[\u4e00-\u9fff]") # detect CJK chars
EN_ONLY = re.compile(r"[^a-z']+") # keep [a-z'] only
def clean_line(s: str) -> str:
# Basic cleaning: lowercase, strip URL/@/HTML/entities, collapse spaces
s = str(s).lower()
s = URL.sub(" ", s); s = AT.sub(" ", s); s = HASH.sub("", s); s = HTML.sub(" ", s)
return re.sub(r"\s+", " ", s).strip()
def tokenize_mixed(text: str):
if CN_CHAR.search(text):
return [t.strip() for t in jieba.cut(text) if t.strip()]
t = EN_ONLY.sub(" ", text)
return [w for w in t.split() if w]
# Stopwords
EN_STOP = set(STOPWORDS) | {
"rt","amp","im","ive","dont","didnt","doesnt","cant","couldnt","isnt","wasnt",
"arent","werent","youre","youve","youll","theyre","weve","well","hes","shes",
"thats","theres","whats","a","an","the","is","are","was","were","be","been","being",
"i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves",
"he","him","his","himself","she","her","hers","herself","it","its","itself","they","them","their","theirs","themselves",
"this","that","these","those","and","but","if","or","because","as","until","while",
"of","at","by","for","with","about","against","between","into","through","during","before","after",
"above","below","to","from","up","down","in","out","on","off","over","under",
"again","further","then","once","here","there","when","where","why","how","all","any","both","each",
"few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very",
"s","t","can","will","just","don","should","now"
}
STOP = EN_STOP
# Optional mask loader (white pixels = allowed region for words).
def load_mask(mask_path):
if not mask_path:
return None
img = Image.open(mask_path).convert("L") # convert to grayscale
arr = np.array(img)
# Treat bright area as 255 (allowed), others 0
return np.where(arr > 200, 255, 0).astype(np.uint8)
MASK_ARRAY = load_mask(MASK_PATH)
# Resolve the text column in a case-insensitive way
def get_text_series(df, col_hint="text"):
cmap = {c.lower().strip(): c for c in df.columns}
if col_hint.lower() in cmap:
col = cmap[col_hint.lower()]
else:
cands = [c for c in df.columns if "text" in c.lower()]
if not cands:
raise KeyError("Text column not found")
col = cands[0]
return df[col].astype(str).fillna("")
# Color strategy (two options)
def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
# Soft random HLS colors within readable saturation/lightness ranges
h = random.randint(0, 359) / 360.0
s = random.randint(55, 85) / 100.0
l = random.randint(35, 60) / 100.0
r, g, b = colorsys.hls_to_rgb(h, l, s)
return (int(r*255), int(g*255), int(b*255))
# Matplotlib colormap (e.g., "tab20"), set COLORMAP_NAME and set COLOR_FUNC=None.
COLOR_FUNC = color_func if COLORMAP_NAME is None else None
COLOR_FUNC = color_func if COLORMAP_NAME is None else None
# Main entry: build a word cloud from a pandas Series of text and export PNG
def build_wordcloud_from_series(text_series: pd.Series, out_prefix: str):
cleaned = text_series.map(clean_line)
tokens = []
for line in cleaned:
for t in tokenize_mixed(line):
if t not in STOP and len(t) > 1:
tokens.append(t)
if not tokens:
raise ValueError("No valid terms detected after preprocessing. Please check input or adjust the stopword settings.")
freq = Counter(tokens) # Frequency count
# Persist top-N token frequencies for audit/plots
os.makedirs(OUT_DIR, exist_ok=True)
pd.DataFrame(freq.most_common(300), columns=["token","count"])\
.to_csv(os.path.join(OUT_DIR, f"{out_prefix}_top_tokens.csv"), index=False)
wc = WordCloud(
width=IMG_SIZE[0], height=IMG_SIZE[1],
background_color=BACKGROUND_COLOR,
max_words=MAX_WORDS,
prefer_horizontal=PREFER_HORIZONTAL,
relative_scaling=RELATIVE_SCALING,
mask=MASK_ARRAY,
colormap=COLORMAP_NAME # If COLOR_FUNC is set, recolor below overrides this
).generate_from_frequencies(freq)
# Optional recolor with custom function
if COLOR_FUNC is not None:
wc = wc.recolor(color_func=COLOR_FUNC, random_state=RAND_SEED)
# Export the PNG
out_png = os.path.join(OUT_DIR, f"{out_prefix}.png")
wc.to_file(out_png)
# Preview in notebooks
plt.figure(figsize=(12, 7))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
print(f"✅ Saved image: {out_png}")
return out_png
# Read df if not provided externally
try:
_ = df
except NameError:
if not os.path.exists(CSV_FALLBACK_PATH):
raise FileNotFoundError("No DataFrame detected, and the CSV file could not be found. Please load the DataFrame or update the CSV_FALLBACK_PATH.")
df = pd.read_csv(CSV_FALLBACK_PATH, low_memory=False)
# Run: one overall cloud or per-group multiple clouds
if GROUP_BY_COL:
# Group by the column and build one cloud per group
for key, g in df.groupby(GROUP_BY_COL):
try:
text_s = get_text_series(g, TEXT_COL_HINT)
build_wordcloud_from_series(text_s, out_prefix=f"wordcloud_{GROUP_BY_COL}_{key}")
except Exception as e:
# Some groups may be empty after cleaning; safely skip
print(f"skip {GROUP_BY_COL}={key}: {e}")
else:
text_s = get_text_series(df, TEXT_COL_HINT)
build_wordcloud_from_series(text_s, out_prefix="wordcloud_text")
✅ Saved image: outputs\wordcloud_text.png
Most Active Country¶
# Clean the Country column
df['Country_clean'] = df['Country'].astype(str).str.strip()
# Group by Country_clean to count number of posts per country
country_activity = (
df.groupby('Country_clean')
.size()
.reset_index(name='Num_Posts') # Number of posts per country
.sort_values(by='Num_Posts', ascending=False) # Sort by number of posts from highest to lowest
)
print("Top 5 most active countries by number of posts:")
display(country_activity.head(5))
# Get the most active country for later analysis
top_country = country_activity.iloc[0]['Country_clean']
print("\nMost active country:", top_country)
Top 5 most active countries by number of posts:
| Country_clean | Num_Posts | |
|---|---|---|
| 32 | USA | 188 |
| 31 | UK | 143 |
| 5 | Canada | 135 |
| 0 | Australia | 75 |
| 13 | India | 70 |
5 rows × 2 columns
Most active country: USA
# Top 10 Active Countries (Bar Plot)
# Count posts per cleaned country name
posts_per_country = (
df['Country_clean']
.value_counts()
.reset_index()
)
posts_per_country.columns = ['Country', 'Posts']
# Select the top 10 countries by post count
top10_countries = posts_per_country.head(10)
# Use Plotly
fig = px.bar(
top10_countries,
x='Country',
y='Posts',
text='Posts',
color='Posts',
color_continuous_scale='Viridis',
title='Top 10 Active Countries',
template='plotly_dark'
)
# Place the numerical labels outside the pillars
fig.update_traces(textposition='outside')
# Increase y-axis max to prevent text cutoff
fig.update_yaxes(range=[0, top10_countries['Posts'].max() * 1.2])
fig.update_layout(
height=500,
width=900,
xaxis_title="Country",
yaxis_title="Number of Posts"
)
fig.show()
Peak Activity by Hours in the Most Active Country¶
# Filter rows for the most active country
df_top_country = df[df['Country_clean'] == top_country].copy()
# Count number of posts per hour
hour_activity = (
df_top_country.groupby('Hour')
.size()
.reset_index(name='Posts')
.sort_values(by='Posts', ascending=False)
)
print("Top posting hours for:", top_country)
display(hour_activity.head(5))
# Plotly
fig = px.bar(
hour_activity.sort_values('Hour'), # Sort hours ascending
x='Hour',
y='Posts',
text='Posts',
color='Posts',
color_continuous_scale='Viridis',
title=f"Peak Activity by Hour in {top_country}",
template='plotly_dark'
)
# Position text labels above bars
fig.update_traces(textposition='outside')
# Add headroom to avoid clipping text labels
fig.update_yaxes(range=[0, hour_activity['Posts'].max() * 1.2])
fig.update_layout(
height=500,
width=900,
xaxis_title="Hour",
yaxis_title="Number of Posts"
)
fig.show()
Top posting hours for: USA
| Hour | Posts | |
|---|---|---|
| 8 | 14 | 26 |
| 10 | 16 | 23 |
| 9 | 15 | 21 |
| 14 | 20 | 15 |
| 15 | 21 | 14 |
5 rows × 2 columns
Peak Activity Months in the Most Active Country¶
# df_top_country = df[df['Country_clean'] == top_country].copy()
# Count number of posts per month
month_activity_raw = (
df_top_country.groupby('Month')
.size()
.reset_index(name='Posts')
)
print("Raw monthly activity (months that appeared in data):")
display(month_activity_raw)
all_months = pd.DataFrame({'Month': list(range(1, 13))})
month_activity = all_months.merge(month_activity_raw, on='Month', how='left')
month_activity['Posts'] = month_activity['Posts'].fillna(0).astype(int)
print("\nFull 12-month activity (1–12 months):")
display(month_activity)
month_names = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_activity["Month_Name"] = month_activity["Month"].apply(lambda x: month_names[x - 1])
# Plotly
fig = px.bar(
month_activity,
x='Month_Name',
y='Posts',
text='Posts',
color='Posts',
color_continuous_scale='Viridis',
title=f"Peak Activity Months in {top_country}",
template='plotly_dark'
)
fig.update_traces(textposition='outside')
fig.update_yaxes(range=[0, month_activity['Posts'].max() * 1.2])
fig.update_layout(
height=500,
width=900,
xaxis_title="Month",
yaxis_title="Number of Posts"
)
fig.show()
Raw monthly activity (months that appeared in data):
| Month | Posts | |
|---|---|---|
| 0 | 1 | 27 |
| 1 | 2 | 24 |
| 2 | 3 | 10 |
| 3 | 4 | 10 |
| 4 | 5 | 10 |
| 5 | 6 | 21 |
| 6 | 7 | 15 |
| 7 | 8 | 21 |
| 8 | 9 | 21 |
| 9 | 10 | 11 |
| 10 | 11 | 8 |
| 11 | 12 | 10 |
12 rows × 2 columns
Full 12-month activity (1–12 months):
| Month | Posts | |
|---|---|---|
| 0 | 1 | 27 |
| 1 | 2 | 24 |
| 2 | 3 | 10 |
| 3 | 4 | 10 |
| 4 | 5 | 10 |
| 5 | 6 | 21 |
| 6 | 7 | 15 |
| 7 | 8 | 21 |
| 8 | 9 | 21 |
| 9 | 10 | 11 |
| 10 | 11 | 8 |
| 11 | 12 | 10 |
12 rows × 2 columns
Posting Frequency by Hour¶
# Count number of posts per hour (0–23)
hour_counts_raw = (
df.groupby('Hour')
.size()
.reset_index(name='Posts')
)
all_hours = pd.DataFrame({'Hour': list(range(24))})
hour_counts = all_hours.merge(hour_counts_raw, on='Hour', how='left')
hour_counts['Posts'] = hour_counts['Posts'].fillna(0).astype(int)
print("Hourly posting frequency (0–23):")
display(hour_counts)
# Plotly
fig = px.bar(
hour_counts,
x='Hour',
y='Posts',
text='Posts',
color='Posts',
color_continuous_scale='Viridis',
title="Posting Frequency by Hour",
template='plotly_dark'
)
fig.update_traces(textposition='outside')
fig.update_yaxes(range=[0, hour_counts['Posts'].max() * 1.2])
fig.update_layout(
height=500,
width=900,
xaxis_title="Hour of Day",
yaxis_title="Number of Posts"
)
fig.show()
Hourly posting frequency (0–23):
| Hour | Posts | |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 1 | 0 |
| 2 | 2 | 1 |
| 3 | 3 | 3 |
| 4 | 4 | 0 |
| 5 | 5 | 1 |
| 6 | 6 | 4 |
| 7 | 7 | 7 |
| 8 | 8 | 23 |
| 9 | 9 | 28 |
| 10 | 10 | 30 |
| 11 | 11 | 37 |
| 12 | 12 | 38 |
| 13 | 13 | 30 |
| 14 | 14 | 94 |
| 15 | 15 | 47 |
| 16 | 16 | 69 |
| 17 | 17 | 48 |
| 18 | 18 | 65 |
| 19 | 19 | 75 |
| 20 | 20 | 50 |
| 21 | 21 | 41 |
| 22 | 22 | 33 |
| 23 | 23 | 7 |
24 rows × 2 columns
Engagement Heatmap by Hour and Platform¶
# Define Engagement metric (Retweets + Likes)
# Make sure Retweets / Likes have no NaN to avoid issues when adding
df['Retweets'] = df['Retweets'].fillna(0)
df['Likes'] = df['Likes'].fillna(0)
# Define Engagement as the sum of Retweets and Likes
df['Engagement'] = df['Retweets'] + df['Likes']
print("Engagement column created. Preview:")
df[['Platform', 'Hour', 'Retweets', 'Likes', 'Engagement']].head()
Engagement column created. Preview:
| Platform | Hour | Retweets | Likes | Engagement | |
|---|---|---|---|---|---|
| 0 | 12 | 15.0 | 30.0 | 45.0 | |
| 1 | 8 | 5.0 | 10.0 | 15.0 | |
| 2 | 15 | 20.0 | 40.0 | 60.0 | |
| 3 | 18 | 8.0 | 15.0 | 23.0 | |
| 4 | 19 | 12.0 | 25.0 | 37.0 |
5 rows × 5 columns
# Engagement Heatmap by Hour and Platform
heat_df = (
df.groupby(['Platform', 'Hour'], as_index=False)['Engagement']
.sum()
)
# Pivot to matrix: rows = Platform, columns = Hour, values = Engagement
heat_pivot = (
heat_df
.pivot(index='Platform', columns='Hour', values='Engagement')
.fillna(0)
)
all_hours = list(range(24))
heat_pivot = heat_pivot.reindex(columns=all_hours, fill_value=0)
# Use Plotly imshow to draw heatmap
fig = px.imshow(
heat_pivot,
x=heat_pivot.columns, # 0–23
y=heat_pivot.index, # platform
color_continuous_scale='Viridis',
template='plotly_dark',
labels={
"x": "Hour of Day",
"y": "Platform",
"color": "Engagement"
},
title="Engagement Heatmap by Hour and Platform (24-hour)",
)
# Adjust figure size
fig.update_layout(
height=500,
width=900,
)
fig.show()
Machine Learning¶
# Prepare Data
text_col = 'Text'
label_col = 'Sentiment_Category'
valid_labels = ['Positive', 'Negative', 'Neutral']
ml_df = df[[text_col, label_col]].dropna().copy()
ml_df = ml_df[ml_df[label_col].isin(valid_labels)]
print("Sample size:", len(ml_df))
display(ml_df.head())
X_text = ml_df[text_col].astype(str)
y = ml_df[label_col].astype(str)
Sample size: 732
| Text | Sentiment_Category | |
|---|---|---|
| 0 | Enjoying a beautiful day at the park! ... | Positive |
| 1 | Traffic was terrible this morning. ... | Negative |
| 2 | Just finished an amazing workout! 💪 ... | Positive |
| 3 | Excited about the upcoming weekend getaway! ... | Positive |
| 4 | Trying out a new recipe for dinner tonight. ... | Neutral |
5 rows × 2 columns
# Train/Test Split + TF-IDF
# Stratify
X_train, X_test, y_train, y_test = train_test_split(
X_text, y,
test_size=0.2,
random_state=42,
stratify=y
)
print("Train size:", len(X_train), " Test size:", len(X_test))
# TF-IDF
vectorizer = TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2),
stop_words='english'
)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
print("TF-IDF matrix shape:", X_train_tfidf.shape)
Train size: 585 Test size: 147 TF-IDF matrix shape: (585, 5000)
# Train Multiple Models
models = {
"PassiveAggressive": PassiveAggressiveClassifier(max_iter=50, random_state=42),
"LogisticRegression": LogisticRegression(max_iter=1000, random_state=42),
"RandomForest": RandomForestClassifier(n_estimators=200, random_state=42),
"SVM": SVC(kernel='linear', random_state=42),
"MultinomialNB": MultinomialNB()
}
results = {}
for name, clf in models.items():
print("\n" + "="*50)
print(f"Training model: {name}")
clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)
acc = accuracy_score(y_test, y_pred)
print(f"{name} Accuracy: {acc:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
results[name] = {"model": clf, "accuracy": acc, "y_pred": y_pred}
==================================================
Training model: PassiveAggressive
PassiveAggressive Accuracy: 0.7143
Classification Report:
precision recall f1-score support
Negative 0.80 0.65 0.72 37
Neutral 0.58 0.41 0.48 34
Positive 0.72 0.88 0.79 76
accuracy 0.71 147
macro avg 0.70 0.65 0.66 147
weighted avg 0.71 0.71 0.70 147
==================================================
Training model: LogisticRegression
LogisticRegression Accuracy: 0.6599
Classification Report:
precision recall f1-score support
Negative 0.94 0.41 0.57 37
Neutral 0.86 0.18 0.29 34
Positive 0.61 1.00 0.76 76
accuracy 0.66 147
macro avg 0.80 0.53 0.54 147
weighted avg 0.75 0.66 0.60 147
==================================================
Training model: RandomForest
RandomForest Accuracy: 0.6463
Classification Report:
precision recall f1-score support
Negative 0.80 0.32 0.46 37
Neutral 0.69 0.26 0.38 34
Positive 0.62 0.97 0.76 76
accuracy 0.65 147
macro avg 0.70 0.52 0.53 147
weighted avg 0.68 0.65 0.60 147
==================================================
Training model: SVM
SVM Accuracy: 0.7279
Classification Report:
precision recall f1-score support
Negative 0.89 0.65 0.75 37
Neutral 0.79 0.32 0.46 34
Positive 0.68 0.95 0.79 76
accuracy 0.73 147
macro avg 0.78 0.64 0.67 147
weighted avg 0.76 0.73 0.70 147
==================================================
Training model: MultinomialNB
MultinomialNB Accuracy: 0.6531
Classification Report:
precision recall f1-score support
Negative 0.94 0.43 0.59 37
Neutral 0.80 0.12 0.21 34
Positive 0.61 1.00 0.76 76
accuracy 0.65 147
macro avg 0.78 0.52 0.52 147
weighted avg 0.74 0.65 0.59 147
# Model Accuracy Comparison
# Create accuracy table
acc_df = pd.DataFrame({
"Model": list(results.keys()),
"Accuracy": [results[m]["accuracy"] for m in results]
})
print("\nCompare Model Accuracy")
display(acc_df)
# Plot bar chart using Plotly
fig = px.bar(
acc_df,
x="Model", # Model name
y="Accuracy",
text="Accuracy",
color="Accuracy",
color_continuous_scale="Viridis",
title="Accuracy Comparison of Models",
template="plotly_dark"
)
# Update layout
fig.update_traces(
texttemplate='%{text:.3f}',
textposition='outside'
)
fig.update_layout(
xaxis_title="Model",
yaxis_title="Accuracy",
yaxis=dict(range=[0, 1]),
width=900,
height=500
)
fig.show()
Compare Model Accuracy
| Model | Accuracy | |
|---|---|---|
| 0 | PassiveAggressive | 0.714286 |
| 1 | LogisticRegression | 0.659864 |
| 2 | RandomForest | 0.646259 |
| 3 | SVM | 0.727891 |
| 4 | MultinomialNB | 0.653061 |
5 rows × 2 columns
# Confusion Matrix
# Select the model with the highest accuracy
best_name = max(results, key=lambda x: results[x]["accuracy"])
best_model = results[best_name]["model"]
best_pred = results[best_name]["y_pred"]
print(f"\nBest model selected: {best_name}")
labels_sorted = ['Negative', 'Neutral', 'Positive']
# Compute confusion matrix
cm = confusion_matrix(y_test, best_pred, labels=labels_sorted)
# Convert to DataFrame for Plotly
cm_df = pd.DataFrame(cm, index=labels_sorted, columns=labels_sorted)
# Plotly
fig = px.imshow(
cm_df,
text_auto=True, # Show numbers inside boxes
color_continuous_scale='Blues',
title=f"Confusion Matrix (Plotly) - {best_name}",
)
fig.update_layout(
xaxis_title="Predicted Label",
yaxis_title="True Label",
width=900,
height=600
)
fig.show()
Best model selected: SVM
# Train Logistic Regression (for feature explanation)
log_clf = LogisticRegression(max_iter=1000, random_state=42)
log_clf.fit(X_train_tfidf, y_train)
# Get feature names from TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()
# Logistic regression coefficient per class
coefs = log_clf.coef_
class_labels = log_clf.classes_
top_n = 15# print top 15 strongest feature words per class
# Loop through each sentiment class
for idx, label in enumerate(class_labels):
coef_for_class = coefs[idx]
# Words with highest positive coefficients
top_pos_idx = np.argsort(coef_for_class)[-top_n:][::-1]
# Words with lowest coefficients
top_neg_idx = np.argsort(coef_for_class)[:top_n]
print("\n=========================================")
print(f"Sentiment Class: {label}")
print("\nTop Positive Words (strong indicators)")
top_pos_words = [feature_names[i] for i in top_pos_idx]
print(top_pos_words)
print("\nTop Negative Words (opposite indicators)")
top_neg_words = [feature_names[i] for i in top_neg_idx]
print(top_neg_words)
========================================= Sentiment Class: Negative Top Positive Words (strong indicators) ['despair', 'loneliness', 'thoughts', 'grief', 'jealousy', 'injustice', 'resentment', 'lingers', 'lost', 'labyrinth', 'bad', 'like', 'heart', 'confusion', 'trust'] Top Negative Words (opposite indicators) ['new', 'beauty', 'laughter', 'excitement', 'exploring', 'serenity', 'friends', 'curiosity', 'nature', 'tales', 'concert', 'sky', 'ancient', 'gratitude', 'surprise'] ========================================= Sentiment Class: Neutral Top Positive Words (strong indicators) ['serenity', 'curiosity', 'awe', 'knowledge', 'fulfillment', 'reverence', 'nostalgia', 'ambivalence', 'empowerment', 'arousal excitement', 'arousal', 'moonlit', 'mysteries', 'uncertainty', 'tales'] Top Negative Words (opposite indicators) ['day', 'heart', 'despair', 'surprise', 'loneliness', 'gratitude', 'just', 'inspiration', 'weekend', 'hopeful', 'friend', 'grief', 'morning', 'warmth', 'contentment'] ========================================= Sentiment Class: Positive Top Positive Words (strong indicators) ['new', 'surprise', 'gratitude', 'laughter', 'friends', 'inspiration', 'weekend', 'hopeful', 'joy', 'creativity', 'pride', 'elation', 'contentment', 'euphoria', 'kindness'] Top Negative Words (opposite indicators) ['lost', 'serenity', 'despair', 'shattered', 'thoughts', 'loneliness', 'silent', 'curiosity', 'knowledge', 'echoes', 'night', 'awe', 'emotional', 'fulfillment', 'labyrinth']